Our project team is VARS Consulting. NBCUniversal has contracted VARS Consulting to analyze box office performance and make strategic recommendations to assess their company’s financial performance for upcoming theatrical releases. As part of this consulting engagement we analyzed the box office data for both NBC Universal and other production companies and used it to drive key insights. In this report we detail how we have gathered and prepared the necessary information and satisfied the clients request using a multiple linear regression model. Finally, we use the selected model to predict likely box office outcomes.
VARS Consulting utilized the following process for this engagement:
| imdbid | title | plot | rating | imdb_rating | metacritic | dvd_release | production | actors | imdb_votes | poster | director | release_date | runtime | genre | awards | keywords | Budget | Box.Office.Gross |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| tt0010323 | The Cabinet of Dr. Caligari | Hypnotist Dr. Caligari uses a somnambulist, Cesare, to commit murders. | UNRATED | 8.1 | N/A | 15-Oct-97 | Rialto Pictures | Werner Krauss, Conrad Veidt, Friedrich Feher, Lil Dagover | 42,583 | https://images-na.ssl-images-amazon.com/images/M/MV5BMTY1NzIxOTcxM15BMl5BanBnXkFtZTgwMjY0ODgwNzE@._V1_SX300.jpg | Robert Wiene | 19-Mar-21 | 67 min | Fantasy, Horror, Mystery | 1 nomination. | expressionism|somnambulist|avant-garde|hypnosis|fair|visit|murder|asylum|violence|opening-a-door|death|flashback-within-a-flashback|costume-horror|flashback|gothic|surrealism|enigma|kidnapping|good-versus-evil|sleepwalking|mind-control|macabre|carnival|mannequin|insanity|evil-doctor|diabolical|madman|megalomania|megalomaniac|tragic-villain|mad-scientist|sideshow|hypnotism|psychopath|somnambulism|psychiatrist|surprise-ending | $18,000 | 0 |
| tt0052893 | Hiroshima Mon Amour | A French actress filming an anti-war film in Hiroshima has an affair with a married Japanese architect as they share their differing perspectives on war. | NOT RATED | 8 | N/A | 24-Jun-03 | Rialto Pictures | Emmanuelle Riva, Eiji Okada, Stella Dassas, Pierre Barbaud | 21,154 | https://images-na.ssl-images-amazon.com/images/M/MV5BMjMyNDYzMzU5OV5BMl5BanBnXkFtZTgwNTUxNzU4MjE@._V1_SX300.jpg | Alain Resnais | 16-May-60 | 90 min | Drama, Romance | Nominated for 1 Oscar. Another 6 wins & 5 nominations. | memory|atomic-bomb|lovers-separation|impossible-love|nuclear-bomb|radiation-victim|nuclear-radiation|hiroshima-japan|post-war|nuclear-weapons|peace|german-soldier|actress|japanese-man|anti-war|first-love|oblivion|japan|nouvelle-vague|humanism|death-of-lover|separation|sleepless-night|voice-over-inner-thoughts|locked-in-a-cellar|tragic-love|traumatic-experience|20th-birthday|death-of-loved-one|death-of-boyfriend|adulterer|adulteress|adulterous-desire|memorial-park|war-memorial|peace-demonstration|film-in-film|nevers-france|two-in-a-shower|survival|archive-footage|museum|radiation-burn|radiation-poisoning|1950s|1940s|city-in-title|three-word-title|french-new-wave|obliviousness|first-person-title|cult-film|place-name-in-title|claim-in-title|part-documentary|asian-man-white-woman-relationship|world-war-two|nonlinear-timeline|jump-cut|flashback|extramarital-affair|surrealism|franco-japanese|hotel|hotel-room|frenchwoman|interracial-love|interracial-couple|interracial-romance|france | $88,300 | 0 |
| tt0058898 | Alphaville | A U.S. secret agent is sent to the distant space city of Alphaville where he must find a missing person and free the city from its tyrannical ruler. | NOT RATED | 7.2 | N/A | 20-Oct-98 | Rialto Pictures | Eddie Constantine, Anna Karina, Akim Tamiroff | 17,801 | https://images-na.ssl-images-amazon.com/images/M/MV5BMzk2MTlkM2EtNzNhYi00Y2YxLWIwODktNGQ0NDM2ZTgwODJiXkEyXkFqcGdeQXVyNjI5NTk0MzE@._V1_SX300.jpg | Jean-Luc Godard | 5-May-65 | 99 min | Drama, Mystery, Sci-Fi | 1 win. | dystopia|french-new-wave|satire|comic-violence|surrealism|nouvelle-vague|avant-garde|neo-noir|secret-agent|lemmy-caution|future|dictionary|conscience|computer|alternate-reality|hard-boiled|spying|spy-hero|spy|reference-to-james-bond|eurospy|espionage|french-science-fiction|car-chase|social-satire|ford-mustang|violence|swimming-pool|spoof|sexism|science-runs-amok|riddle|neon|negative-footage|mind-control|mathematical-equation|gun-violence|galaxy|forbidden-speech|evil-computer|public-execution|totalitarianism|philosophy|bible|nudity|utopia-quest|dictator|detective|metropolis|artificial-intelligence|spiral-staircase|based-on-novel|character-name-in-title | $220,000 | $46,585 |
| tt0074252 | Ugly, Dirty and Bad | Four generations of a family live crowded together in a cardboard shantytown shack in the squalor of inner-city Rome. They plan to murder each other with poisoned dinners, arson, etc. The … | N/A | 7.9 | N/A | 1-Nov-16 | Compagnia Cinematografica Champion | Nino Manfredi, Maria Luisa Santella, Francesco Anniballi, Maria Bosco | 5,705 | https://images-na.ssl-images-amazon.com/images/M/MV5BMTEwMzkwMDgxNTdeQTJeQWpwZ15BbWU4MDc3MzM1MzAy._V1_SX300.jpg | Ettore Scola | 23-Sep-76 | 115 min | Comedy, Drama | 1 win & 2 nominations. | incest|failed-murder-attempt|poisoned-food|baptism|planning-a-murder|woman-in-a-wheelchair|cantankerous-old-woman|tv-reporter|domestic-violence|woman-with-mustache|brother-in-law-sister-in-law-sex|avarice|cupidity|dysfunctional-family|burnt-face|sexual-promiscuity|money-roll|misery|family-patriarch|traveling-salesman|promiscuity|drunkenness|greed|scooter|nude-model|nude-photograph|large-family|360-degree-pan|vomiting|sex|lumpenproletariat|consumerism|dream|black-comedy|sea|pregnancy|extramarital-affair|prostitute|male-prostitute|crossdresser|poisoning|long-take|rome-italy|family-relationships|commedia-all’italiana|woman-on-top|slum|italy|1970s|poison|burlesque|poverty | $6,590 | 0 |
| tt0084269 | Losing Ground | A comedy-drama about a Black American female philosophy professor and her insensitive, philandering, and flamboyant artist husband who are having a marital crisis. When the wife goes off on… | N/A | 6.3 | N/A | N/A | Milestone Film & Video | Billie Allen, Gary Bolling, Clarence Branch Jr., Joe Garcia | 132 | https://images-na.ssl-images-amazon.com/images/M/MV5BMTUwMzQzNDg0MV5BMl5BanBnXkFtZTgwMDgwMDUwODE@._V1_SX300.jpg | Kathleen Collins | 1-Jun-82 | 86 min | Comedy, Drama | N/A | artist|painter|marriage|black-independent-film|independent-film|professor|f-rated|written-by-director|title-directed-by-female|female-director|swimming-pool|swimming|filming|kissing|painting|two-word-title|black-middle-class|middle-class|middle-age-couple|love-triangle|philosopher | 0 | 0 |
| tt0085180 | L’argent | A forged 500-franc note is cynically passed from person to person and shop to shop, until it falls into the hands of a genuine innocent who doesn’t see it for what it is - which will have … | N/A | 7.5 | 95 | 24-May-05 | Criterion Collection | Christian Patey, Vincent Risterucci, Caroline Lang, Sylvie Van den Elsen | 5,607 | https://images-na.ssl-images-amazon.com/images/M/MV5BY2RlYTc2ZGUtMGFlNS00ZGUxLWEzODYtYjJhY2RmOWRkYzY4L2ltYWdlXkEyXkFqcGdeQXVyNDQzMDg4Nzk@._V1_SX300.jpg | Robert Bresson | 18-May-83 | 85 min | Crime, Drama | 2 wins & 3 nominations. | note|murder|solitary-confinement|robbery|delivery-man|camera|bank-robbery|money|pushing|table|objectified-woman|unreliable-employee|woman-in-bed|old-woman-murdered|murdered-in-a-bed|pretending-to-take-medicine|wristwatch|working-class|wine|whiskey-bottle|wheelchair|wheelbarrow|wealth|washing-clothes|waiter|vengeance|valium|trial|toy-store|thief|theft|telephone-call|teenage-boy|teacher|suitcase|subway|stranger|sticking-out-one’s-tongue|stealing|stakeout|sorting-mail|sleeping|sleeping-in-a-barn|sister-sister-relationship|sink|sidewalk-cafe|shopkeeper|shop-window|serving-ladle|searching|scrapbook|school|schoolboy|running|returned-mail|return-to-sender|restaurant|release-from-prison|redemption|reckless-driving|pursuit|purse|punishment|prisoner|prison-visitation|prison-guard|prison-discharge|prison-cell|prison-break|prison-alarm|priest|prank|policeman|police-van|police-station|police-car|pitchfork|pill|picture-frame|piano|piano-teacher|piano-player|photographer|perjury|passing-note|pajamas|pacing|newspaper|murderer|murder-of-family|mother-son-relationship|mother-daughter-relationship|moped|metro|mass|mass-murder|map|mail|magazine|loan|lie|liar|letter|letter-censorship|lawyer|lantern|knife|knees|key|judge|ironing|invoice|investigator|investigation|imprisonment|husband-wife-relationship|hunger|humiliation|hotel|hospital|helmet|heart-monitor|headmaster|handcuffs|gun|gunshot|guilt|greed|getaway-car|gas|gas-delivery-man|garden|friend|friendship|fraud|france|forgiveness|forgery|footbridge|food|following|floor-polisher|fleeing|father-son-relationship|father-daughter-relationship|family-relationships|pretending-to-take-a-pill|face-slap|escape|fired-from-the-job|elevator|drink|drinking|drain|dog|dining-hall|digging-for-potatoes|death|death-of-husband|death-of-daughter|darkroom|custody|cross|court|courtroom|corruption|confession|coffee|clothes-line|class|classroom|cigarette-smoking|check|chase|charity|cell-mate|catholic|catholic-church|cash-register|car-accident|camera-store|cafe|burglary|burglar-alarm|broken-glass|broken-dish|breaking-and-entering|bread|blood|betrayal|beating|bakery|axe-murder|arrest|ambulance|alarm|accusation|accomplice|suicide-attempt|sleeping-pills|photo-shop|multiple-murder|master-key|foreign-language-adaptation|false-testimony|chain-reaction|based-on-short-story|scam|ex-convict|counterfeit|police|axe|prison|death-of-child|female-nudity | 0 | 0 |
| variable | description |
|---|---|
| imdbid | Unique ID used by IMDB to refer to the movie. |
| title | Title of the movie |
| plot | Movie plot summary |
| rating | MPAA appropriate audience rating |
| imdb_rating | IMBD voters scoring of a movie on a scale from 1-10 (10 being best) |
| metacritic | Metacritic movie score on a scale of 0-100 (100 being best) |
| dvd_release | Movie release date on DVD |
| production | Principal production company |
| actors | Lead actors |
| imdb_votes | Total votes from IMDB members. |
| poster | Movie poster artwork |
| director | Movie director |
| release_date | Theatrical release date |
| runtime | Runtime length of movie in minutes |
| genre | Genre classification |
| awards | Academy awards & nominations |
| keywords | Keywords associated with the movie |
| Budget | Budget spent on the movie production, marketing, and distribution. |
| box office gross | Box office gross returns as of 9/21/2017 |
| Season.Start | Season.End | Box.Office.Season | Season.Gross | Season.YoY | Season.Days | Season.Daily.Avg | Season.Movie.Count | Season.Move.Avg.Gross |
|---|---|---|---|---|---|---|---|---|
| 1/5/07 | 3/1/07 | Winter | 890.1 | -0.024 | 58 | 15.3 | 70 | 12.7 |
| 3/2/07 | 5/3/07 | Spring | 1342.3 | -0.002 | 62 | 21.7 | 116 | 11.6 |
| 5/4/07 | 9/3/07 | Summer | 4210.5 | 0.128 | 122 | 34.5 | 218 | 19.3 |
| 9/4/07 | 11/1/07 | Fall | 947.9 | -0.138 | 58 | 16.3 | 120 | 7.9 |
| 11/2/07 | 1/4/08 | Holiday | 2299.9 | 0.082 | 60 | 38.3 | 107 | 21.5 |
| 1/5/08 | 3/6/08 | Winter | 1052.7 | 0.183 | 64 | 16.4 | 99 | 10.6 |
| imdbid | title | plot | rating | imdb_rating | metacritic | dvd_release | production | actors | imdb_votes | poster | director | release_date | runtime | genre | awards | Budget | Box.Office.Gross | Box.Office.Season | Season.Gross | Season.YoY.Change | Season.Days | Season.Daily.Avg | Season.Movie.Count | Season.Movie.Avg | keywords |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| tt0200465 | The Bank Job | Martine offers Terry a lead on a foolproof bank hit on London’s Baker Street. She targets a roomful of safe deposit boxes worth millions in cash and jewelry. But Terry and his crew don’t realize the boxes also contain a treasure trove of dirty secrets - secrets that will thrust them into a deadly web of corruption and illicit scandal. | R | 7.3 | 69 | 15-Jul-08 | Lionsgate | Jason Statham, Saffron Burrows, Stephen Campbell Moore, Daniel Mays | 158,562 | https://images-na.ssl-images-amazon.com/images/M/MV5BMTUwMzc1MDMxOV5BMl5BanBnXkFtZTcwODY4OTIzMw@@._V1_SX300.jpg | Roger Donaldson | 7-Mar-08 | 111 min | Crime, Drama, Romance | 3 nominations. | 20000000 | 30028592 | Spring | 1074900000 | -0.20 | 55 | 19543636 | 111 | 9700000 | safe-deposit|heist|chase|mobster|london-england|bank|pornographer|blackmail|secret-service|bank-vault|crooked-policeman|heist-movie|bank-heist|walkie-talkie|torture|robbery|tunnel|murder|car-dealer|brothel|airport|train-station|revolver|weapon|ford-transit|ford|van|car-salesman|based-on-true-events|woman|20th-century|england|united-kingdom|champagne|red-wine|beer|female-frontal-nudity|gunfight|shootout|fistfight|sex-scene|kissing-while-having-sex|kiss|neo-noir|violence|incriminating-photograph|s&m|casual-sex|black-activist|paparazzi|year-1971|period-piece|dutch-angle|debt|peace-sign|masochism|doublecross|extortion|recruiting|planning|stripper|caper|mafia|suffocation|strangulation|stabbed-in-the-back|pistol|double-cross|death|window-smashing|what-happened-to-epilogue|wedding-reception|vulgarity|trinidad|tailor|subway|strip-club|stabbing|shot-in-the-head|rooftop|revolutionary|railway-station|pub|prologue|princess|pornographic-film|political-corruption|police-corruption|photograph|parking-garage|open-grave|nonlinear-timeline|menage-a-trois|marriage|machete|jackhammer|infidelity|hidden-camera|ham-radio|fishing-boat|female-nudity|fashion-model|drug-smuggling|customs|courtroom|caribbean|cabinet-officer|brick|bracelet|book-party|beach|basement|assault|ambulance|1970s|dominatrix|death-of-friend|based-on-true-story |
| tt0315642 | Wazir | A grief-stricken cop and an amputee grandmaster are brought together by a peculiar twist of fate as part of a wider conspiracy that has darkened their lives. | N/A | 7.2 | N/A | N/A | Rajkumar Hirani Films | Amitabh Bachchan, Farhan Akhtar, Aditi Rao Hydari, Manav Kaul | 12,764 | https://images-na.ssl-images-amazon.com/images/M/MV5BMTUzNDU4NDMyOV5BMl5BanBnXkFtZTgwNjcyNzU0NzE@._V1_SX300.jpg | Bejoy Nambiar | 8-Jan-16 | 103 min | Action, Crime, Drama | 1 nomination. | 586028 | 0 | Winter | 1144900000 | 0.02 | 59 | 19405085 | 86 | 13300000 | chess-grandmaster|chess|race-against-time|one-word-title|character-name-in-title |
| tt0323808 | The Wicker Tree | Charmed by the residents of Tressock, Scotland, two young missionaries accept the invitation to participate in a local festival, fully unaware of the consequences of their decision. | R | 3.9 | N/A | 24-Apr-12 | Anchor Bay Entertianment | Brittania Nicol, Henry Garrett, James Mapes, Lesley Mackie | 2,155 | https://images-na.ssl-images-amazon.com/images/M/MV5BMTkyNzkyODE5N15BMl5BanBnXkFtZTcwNjUxNzIxNw@@._V1_SX300.jpg | Robin Hardy | 27-Jan-12 | 96 min | Drama, Horror | N/A | 7750000 | 0 | Winter | 1243900000 | 0.36 | 58 | 21446552 | 88 | 14100000 | sex-scene|female-nudity|folk-horror|british-horror|supernatural-horror|three-word-title|plant-in-title|satire|black-comedy|second-part|sequel |
| tt0326965 | In My Sleep | Marcus is a popular massage therapist who struggles with parasomnia, a severe sleepwalking disorder that causes him to do things in his sleep that he cannot remember the next day. When he … | PG-13 | 5.6 | 33 | 1-Oct-10 | Morning Star Pictures | Philip Winchester, Tim Draxl, Lacey Chabert, Abigail Spencer | 1,741 | https://images-na.ssl-images-amazon.com/images/M/MV5BNzg1MDM1NzIwMV5BMl5BanBnXkFtZTcwNzMxMTU1MQ@@._V1_SX300.jpg | Allen Wolf | 23-Apr-10 | 104 min | Drama, Mystery, Thriller | 6 wins. | 1000000 | 57190 | Spring | 1626800000 | 0.30 | 62 | 26238710 | 93 | 17500000 | knife|flashback|falling-down-stairs|policeman|cemetery|surprise-party|spa|swimming-pool|handcuffs|man-in-swimsuit|beefcake|bare-chested-male|vegetarian|vegan|independent-film |
| tt0327597 | Coraline | An adventurous girl finds another world that is a strangely idealized version of her frustrating home, but it has sinister secrets. | PG | 7.7 | 80 | 21-Jul-09 | Focus Features | Dakota Fanning, Teri Hatcher, Jennifer Saunders, Dawn French | 159,786 | https://images-na.ssl-images-amazon.com/images/M/MV5BMzQxNjM5NzkxNV5BMl5BanBnXkFtZTcwMzg5NDMwMg@@._V1_SX300.jpg | Henry Selick | 6-Feb-09 | 100 min | Animation, Family, Fantasy | Nominated for 1 Oscar. Another 7 wins & 43 nominations. | 60000000 | 75286229 | Winter | 1227100000 | 0.17 | 62 | 19791935 | 64 | 19200000 | parallel-worlds|stop-motion|scissors|new-home|eye|dream|secret-door|cat|rescue|talking-cat|spiderweb|seashell-bikini|spiral-staircase|scene-after-end-credits|puppet-animation|garment-button|lifting-someone-into-the-air|thunderstorm|monster|crying|shadow|loneliness|boat|rainbow|blood|orphan|woods|forename-as-title|one-word-title|bechdel-test-passed|husband-wife-relationship|little-girl|moving-crew|pet-cat|stuffed-animal|pet-dog|old-woman|thunder|letter|bicycling|little-boy|dowsing|forest|movers|search|clue|flashlight|fear|riddle|tears|fireplace|stage|anger|spotlight|beetle|big-top|child-protagonist|female-protagonist|search-for-parent|sleep|void|alternate-world|camera|walker|stabbing-a-doll|cane|hanging-from-a-flagpole|angel|snow|eccentric|pizza|lemonade|stuffed-toy-dog|moving|swinging-on-a-door|stuffed-animal-toy|reflection|motorcycle|other-father|tea|catalogue|gloves|clothing-store|toy-chest|texas|michigan|suitcase|rain|hummingbird|flowers|breaking-mirror|candy|disappearance-of-one’s-father|disappearance|running-away|dinner|theatre-audience|theatre-production|diving-into-a-barrel|song|singing|singer|shakespearean-quotation|reference-to-william-shakespeare|lightning|moving-van|mermaid|actress|cheese|balancing-on-a-balcony-railing|balcony|parachute|knitting-needle|mirror|computer|voice-over-narration|wallpaper|corset|cell-phone|dog|oregon|player-piano|pianist|eyeglasses|photograph|skipping-stone|milkshake|tent|pajamas|presidents’-day|lorgnette|hide-and-seek|cake|candle|bedroom|nightmare|sleeping|prayer|eating|food|buxom|boy|mother-daughter-relationship|father-daughter-relationship|sewing|thread|needle|horror-for-children|dark-fantasy|theatre|blue-hair|tunnel|trapeze|toy-train|tickling|tea-leaves|snowglobe|slug|praying-mantis|peeling-skin|mouse|high-dive|garden|fortune-telling|fog|dowsing-rod|cannon|bicycle|bat|poison-oak|mirror-as-portal|blowing-a-raspberry|well-shaft|rat|old-mansion|metamorphosis|mechanical-hand|insect|grandmother-grandson-relationship|eye-cut-out|cat-and-mouse|bug|acrobat|impostor|talking-animal|secret-passage|piano|old-dark-house|neighbor|kidnapping|key|ghost|ghost-child|game-playing|fantasy-world|doll|cotton-candy|circus|surrealism|3-dimensional|cult-film|stop-motion-animation|death-of-mother|based-on-novel|title-spoken-by-character|character-name-in-title|laptop-computer |
| tt0337584 | Backseat | A “coming of age” story where two old friends flee from New York City on a three-day road trip to Montreal, Canada to escape their problems. | N/A | 7.0 | 32 | N/A | Truly Indie | Josh Alexander, Starla Benford, William Bogert, Robert T. Bogue | 69 | https://images-na.ssl-images-amazon.com/images/M/MV5BMTUwMDk1ODczMl5BMl5BanBnXkFtZTcwMzU2OTc1MQ@@._V1_SX300.jpg | Bruce Van Dusen | 28-Mar-08 | 80 min | Comedy | 1 win. | 12343 | 0 | Spring | 1074900000 | -0.20 | 55 | 19543636 | 111 | 9700000 | road-trip|highway-travel|road-movie|on-the-road |
| Feature | Outcome | Evaluation |
|---|---|---|
| imdbid | Eliminated | Determined to not be meaningful for a multiple linear regression approach. |
| title | Eliminated | Determined to not be meaningful for a multiple linear regression approach. |
| plot | Eliminated | Determined to not be meaningful for a multiple linear regression approach. |
| rating | Cleaned | Category needed cleaning, so we collapsed in to fewer number of like categories |
| imdb_rating | Left As-Is | Used continuous variable without transformation |
| metacritic | Left As-Is | Used continuous variable without transformation |
| dvd_release | Eliminated | Determined to not be meaningful for a multiple linear regression approach. |
| production | Split & Cleaned | Kept only the first item in the comma delimited list, then collapsed in to fewer categories. We grouped companies together that were misspelled or named only slightly different. |
| actors | Split | Split the first and second items in a comma separated list as Lead1 and Lead2. |
| imdb_votes | Left As-Is | Used continuous variable without transformation |
| poster | Eliminated | Determined to not be meaningful for a multiple linear regression approach. |
| director | Split | Split the first item in a comma separated list as Director. |
| release_date | Eliminated | Determined to not be meaningful for a multiple linear regression approach. |
| runtime | Left As-Is | Used continuous variable without transformation |
| genre | Split | Split the first and second items in a comma separated list as genre1 and genre2. |
| awards | Eliminated | Determined to not be meaningful for a multiple linear regression approach. |
| keywords | Eliminated | Determined to not be meaningful for a multiple linear regression approach. |
| Budget | Left As-Is | Used continuous variable without transformation |
| box office gross | Left As-Is | Used continuous variable without transformation |
| Box Office Season | Left As-Is | Used continuous variable without transformation |
| Season Gross | Left As-Is | Used continuous variable without transformation |
| Season YoY | Left As-Is | Used continuous variable without transformation |
| Season Days | Left As-Is | Used continuous variable without transformation |
| Season Daily Avg | Left As-Is | Used continuous variable without transformation |
| Season Movie Count | Left As-Is | Used continuous variable without transformation |
| Season Move Avg Gross | Left As-Is | Used continuous variable without transformation |
| Year | Left As-Is | Used continuous variable without transformation |
Our client asked us to focus strictly on their live-action, feature films that were released in the US over the last 5 years. Therefore, we removed observations with the following attributes:
- International Movies: Movies from production companies outside the US were removed, and were not factored in to our exploratory data analysis.
- Genre filtering: Movies with genre1 of ‘Animation’, or ‘Documentary’ were filtered out at client’s request.
- Release Date filtering: Movies prior to July of 2012 were filtered out at the client’s request.
- Runtime filtering: Only movies with a runtime of 80 minutes or longer are officially recognized as feature films. Therefore, due to client’s requirements, and film with a runtime less that 80 minutes was eliminated.
Once we had filtered out observations that were not to be used we still had some missing values that need to be dealt with. Here is how we handled those features with missing values:
- Budget & Box Office Gross: Any movie that had been released but budget and/or box office gross total could not be obtained were eliminated from the dataset. We did not feel these values could be reliably imputed due to vast variance in these numbers.
- IMDB rating, IMDB votes, &Metacritic Score: These values were fairly normally distributed so we imputed missing values by using the feature’s median value.
At VARS Consulting, our domain knowledge allows us to create meaningful features using source data to enhance the model input data and provide the best possible results. We engineered, or added, several features based on proprietary criteria to enable very high cardinality categorical data such as Production Company, Director, & Actors to provide meaning. This is especially needed because past success for Directors and Actors can heavily influence future movie performance. Our methodology was as follows:
Box Office Performance Points
>Each movie was ranked according to lifetime box office performance. We then identify movies that were ranked 1-10 as Top 10 movies, movies ranked 11-50 as Top 50 movies, and movies ranked 51-250 as Top 250 movies. Each Production, Director and Actor in the data set was then awarded 250 performance points for each movie in the Top 10, 50 performance points for each movie in the Top 50, and 10 performance points for each movie in the Top 250.
[Performance Points] = [ SUM(Top 10 Count) x 250 ] + [ SUM(Top 50 Count) x 50 ] + [ SUM(Top 250 Count) x 10 ]
| Director | Sum.of.Top.10 | Sum.of.Top.50 | Sum.of.Top.250 | Points |
|---|---|---|---|---|
| Christopher Nolan | 2 | 1 | 2 | 570 |
| Joss Whedon | 2 | 0 | 0 | 500 |
| Bill Condon | 1 | 2 | 0 | 350 |
| J.J. Abrams | 1 | 0 | 3 | 280 |
| Gareth Edwards | 1 | 0 | 1 | 260 |
| Andrew Stanton | 1 | 0 | 0 | 250 |
| Colin Trevorrow | 1 | 0 | 0 | 250 |
| James Cameron | 1 | 0 | 0 | 250 |
| Francis Lawrence | 0 | 3 | 0 | 150 |
| Jon Favreau | 0 | 3 | 0 | 150 |
| Michael Bay | 0 | 2 | 2 | 120 |
| David Yates | 0 | 2 | 1 | 110 |
| Zack Snyder | 0 | 2 | 1 | 110 |
| James Gunn | 0 | 2 | 0 | 100 |
| Clint Eastwood | 0 | 1 | 2 | 70 |
| Peter Jackson | 0 | 1 | 2 | 70 |
| Todd Phillips | 0 | 1 | 2 | 70 |
| Anthony Russo | 0 | 1 | 1 | 60 |
| Byron Howard | 0 | 1 | 1 | 60 |
| Chris Renaud | 0 | 1 | 1 | 60 |
| James Wan | 0 | 1 | 1 | 60 |
| Kyle Balda | 0 | 1 | 1 | 60 |
| Pierre Coffin | 0 | 1 | 1 | 60 |
| Steven Spielberg | 0 | 1 | 1 | 60 |
Finally, we brought our cleaned and prepared dataset in to R to begin data analysis and model building.
| Box.Office.Gross | Season.Gross | Season.YoY.Change | Season.Days | Season.Daily.Avg | Season.Movie.Count | Season.Movie.Avg | Budget | runtime | imdb_rating | imdb_votes | metacritic | Director.Perf.Pts | Lead1.Perf.Points | Lead2.Perf.Points | Rating.Genre.Perf.Pts | Box.Office.Season | rating.group | genre1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 31537320 | 1397500000 | -0.059 | 55 | 25409091 | 113 | 12400000 | 1e+06 | 83 | 5.6 | 50796 | 59 | 0 | 0 | 0 | 0 | Spring | R | Drama |
| 64473115 | 4851100000 | 0.127 | 122 | 39763115 | 232 | 20900000 | 3e+06 | 85 | 5.7 | 155490 | 41 | 0 | 0 | 0 | 0 | Summer | R | Horror |
| 35385560 | 1117700000 | 0.017 | 62 | 18027419 | 103 | 10900000 | 4e+06 | 91 | 4.6 | 30304 | 30 | 0 | 0 | 0 | 10 | Winter | R | Mystery |
| 2184640 | 1277200000 | -0.161 | 58 | 22020690 | 133 | 9600000 | 5e+06 | 118 | 4.1 | 5359 | 42 | 10 | 0 | 0 | 60 | Fall | PG | Adventure |
| 65206105 | 1277200000 | -0.161 | 58 | 22020690 | 133 | 9600000 | 5e+06 | 94 | 6.2 | 82263 | 55 | 20 | 0 | 0 | 10 | Fall | PG-13 | Horror |
| 50856010 | 1522300000 | 0.325 | 65 | 23420000 | 159 | 9600000 | 5e+06 | 89 | 4.4 | 37506 | 38 | 0 | 0 | 0 | 10 | Fall | PG-13 | Horror |
Exploratory data analysis is an approach that utilizes various techniques to detect any mistakes, check underlying assumptions and roughly determine the relationship among the explanatory variables. Some EDA techniques are graphical in nature whereas some are quantitative.
Depending on the type of data that has to be explored, the exploratory data analysis can be of the following:
We checked some box-plots as they show robust measures of location and spread along with information about symmetry and outliers. Similarly histograms were reviewed as they quickly depict the central tendency and modality of the data.
In case of our client, we considered certain numerical values for conducting exploratory data analysis. The run time of the movies, number of votes on IMDb in last 5 years and the available budget of every movie was studied from its graphical nature. Here are some of our findings:
In addition, both the season the movie is released and the rating given by the MPAA can have a significant impact on the box office performance as seen in these graphs:
The combination of rating, and season also creates significant variation in performance:
As per the EDA we observed that the data was skewed for certain parameters, and needed smoothing. We cannot always rely on data uploaded by the source and therefore, we had to make some modifications and clean the data to make it relevant for superior analysis.
At first, we simply build a model that utilizes all data to evaluate the potential of a linear model for the client:
bo <- lm(Box.Office.Gross ~ .
,data = all_vars)
summary(bo)
##
## Call:
## lm(formula = Box.Office.Gross ~ ., data = all_vars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60239091 -16845344 0 18832919 64086533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.304e+08 3.563e+08 -0.647 0.52229
## Season.Gross -1.151e-01 8.783e-02 -1.310 0.19926
## Season.YoY.Change -1.007e+06 7.604e+07 -0.013 0.98951
## Season.Days 1.844e+06 4.938e+06 0.374 0.71114
## Season.Daily.Avg 1.099e+00 9.605e+00 0.114 0.90961
## Season.Movie.Count 1.049e+06 2.078e+06 0.505 0.61713
## Season.Movie.Avg 6.840e+00 2.059e+01 0.332 0.74190
## Budget 4.764e-01 2.618e-01 1.820 0.07791 .
## runtime 5.359e+05 6.344e+05 0.845 0.40434
## imdb_rating 5.441e+06 1.236e+07 0.440 0.66266
## imdb_votes 6.128e+01 8.942e+01 0.685 0.49795
## metacritic 3.425e+05 6.344e+05 0.540 0.59290
## Director.Perf.Pts 1.236e+06 7.323e+05 1.688 0.10085
## Lead1.Perf.Points 8.152e+05 4.478e+05 1.821 0.07776 .
## Lead2.Perf.Points -6.930e+04 4.283e+05 -0.162 0.87246
## Rating.Genre.Perf.Pts -3.369e+04 9.975e+03 -3.378 0.00189 **
## Box.Office.SeasonHoliday 7.215e+07 1.483e+08 0.487 0.62980
## Box.Office.SeasonSpring 7.077e+07 3.318e+07 2.133 0.04045 *
## Box.Office.SeasonSummer 9.416e+07 2.290e+08 0.411 0.68365
## Box.Office.SeasonWinter 4.336e+07 5.122e+07 0.846 0.40340
## rating.groupPG-13 -5.887e+07 7.615e+07 -0.773 0.44496
## rating.groupR -5.837e+07 7.395e+07 -0.789 0.43554
## genre1Adventure -6.989e+07 4.662e+07 -1.499 0.14331
## genre1Biography -4.037e+07 3.266e+07 -1.236 0.22528
## genre1Comedy -7.419e+06 1.961e+07 -0.378 0.70758
## genre1Crime -4.551e+07 2.931e+07 -1.553 0.13004
## genre1Drama -1.239e+07 2.790e+07 -0.444 0.65983
## genre1Horror 3.977e+07 2.717e+07 1.464 0.15272
## genre1Mystery 5.218e+06 4.768e+07 0.109 0.91351
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38920000 on 33 degrees of freedom
## Multiple R-squared: 0.9137, Adjusted R-squared: 0.8404
## F-statistic: 12.47 on 28 and 33 DF, p-value: 8.405e-11
par(mfrow = c(2, 2))
plot(bo)
The initial look makes us believe we have a data structure conducive to linear regression, but we need to find the simplest combination of dependent variables that still produces acceptable results. To do this we employ a best subset approach:
# BEGIN best subset approach
#
best.subset <- regsubsets(Box.Office.Gross~., all_vars, nvmax = 30, nbest = 10, really.big = T)
best.subset.summary <- summary(best.subset)
# Show plots evaluating the results of best subset approach
par(mfrow=c(2,2))
plot(best.subset$rss, xlab="Number of Variables", ylab="RSS", type="l")
plot(best.subset.summary$adjr2, xlab="Model Index Number", ylab="Adjusted RSq", type="l")
plot(best.subset.summary$cp, xlab="Model Index Number", ylab="CP", type="l")
plot(best.subset.summary$bic, xlab="Model Index Number", ylab="BIC", type="l")
In addition to R-squared, Adjusted R-squared, Cp, & BIC values we need to understand VIF and Durbin Watson values of each prospective model to make the best possible selection. VARS Consulting adds these values to the best subset data frame as follows:
bt <- best.subset.summary$which
best.subset.tests <- data.frame(vif=double(),dwtval=double(),dwpval=double())
for (i in 1:length(bt[,2])) {
loop_df <- all_vars
for (j in 1:13) {
if (bt[i,j] == FALSE) {
loop_df <- loop_df[,!names(loop_df) %in% colnames(bt)[j]]
#print(colnames(bt)[j])
} #else {print("It's false")}
}
vif_val <- -999
dwt_val <- -999
dwp_val <- -999
tryCatch({
vif_val <- max(vif(lm(Box.Office.Gross ~.,data = loop_df))[,3])
dwt_val <- durbinWatsonTest(lm(Box.Office.Gross ~.,data = loop_df))$dw
dwp_val <- durbinWatsonTest(lm(Box.Office.Gross ~.,data = loop_df))$p
}, error=function(e){})
best.subset.tests[i,1] <- vif_val
best.subset.tests[i,2] <- dwt_val
best.subset.tests[i,3] <- dwp_val
}
all_results <- data.frame(best.subset.summary$rsq,best.subset.summary$adjr2,best.subset.summary$cp,best.subset.summary$bic,best.subset.tests)
Finally, we filter the data frame to find the prospective model that meets all of our criteria thresholds:
# Filter data frame to find models that meet all criteria
all_results[all_results$vif < 5 & all_results$vif > 0 & all_results$best.subset.summary.rsq > .7 & all_results$best.subset.summary.adjr2 > 0.7 & all_results$dwtval > 1.95 & all_results$dwtval < 2.05,] #& all_results$dwtval > 2.00
## best.subset.summary.rsq best.subset.summary.adjr2
## 82 0.8777332 0.8565717
## 89 0.8739507 0.8521345
## best.subset.summary.cp best.subset.summary.bic vif dwtval dwpval
## 82 4.737516 -89.02476 4.928035 2.029804 0.828
## 89 6.183427 -87.13574 4.610032 2.032820 0.836
# Show variables of selected model
best.subset.summary$which[82,]
## (Intercept) Season.Gross Season.YoY.Change
## TRUE FALSE FALSE
## Season.Days Season.Daily.Avg Season.Movie.Count
## FALSE TRUE FALSE
## Season.Movie.Avg Budget runtime
## FALSE TRUE FALSE
## imdb_rating imdb_votes metacritic
## FALSE FALSE FALSE
## Director.Perf.Pts Lead1.Perf.Points Lead2.Perf.Points
## TRUE TRUE FALSE
## Rating.Genre.Perf.Pts Box.Office.SeasonHoliday Box.Office.SeasonSpring
## TRUE TRUE TRUE
## Box.Office.SeasonSummer Box.Office.SeasonWinter rating.groupPG-13
## TRUE FALSE FALSE
## rating.groupR genre1Adventure genre1Biography
## FALSE FALSE FALSE
## genre1Comedy genre1Crime genre1Drama
## FALSE FALSE FALSE
## genre1Horror genre1Mystery
## TRUE FALSE
The model chosen, based on our proprietary best subset approach tells us that the simplest model that performs best includes:
- Season Daily Average
- Budget
- Director Performance Points
- Lead Actor 1 Performance Points
- Rating Genre Performance Points
- Box Office Season
- Genre 1
This is good confirmation that VARS Consulting feature engineering efforts were of significant value to the final model.
bo4 <- lm(Box.Office.Gross ~
Season.Daily.Avg
+ Budget
+ Director.Perf.Pts
+ Lead1.Perf.Points
+ Rating.Genre.Perf.Pts
+ Box.Office.Season
+ genre1
,data = all_vars)
summary(bo4)
##
## Call:
## lm(formula = Box.Office.Gross ~ Season.Daily.Avg + Budget + Director.Perf.Pts +
## Lead1.Perf.Points + Rating.Genre.Perf.Pts + Box.Office.Season +
## genre1, data = all_vars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -92277096 -15242764 -4034985 18327847 73913164
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.552e+08 5.301e+07 2.927 0.005346 **
## Season.Daily.Avg -6.046e+00 2.314e+00 -2.613 0.012154 *
## Budget 4.714e-01 1.678e-01 2.810 0.007309 **
## Director.Perf.Pts 1.028e+06 5.261e+05 1.954 0.056929 .
## Lead1.Perf.Points 1.033e+06 3.784e+05 2.729 0.009028 **
## Rating.Genre.Perf.Pts -3.191e+04 8.196e+03 -3.894 0.000324 ***
## Box.Office.SeasonHoliday 1.501e+08 5.615e+07 2.672 0.010450 *
## Box.Office.SeasonSpring 8.214e+07 2.451e+07 3.351 0.001637 **
## Box.Office.SeasonSummer 1.202e+08 3.465e+07 3.469 0.001164 **
## Box.Office.SeasonWinter 7.917e+06 1.809e+07 0.438 0.663766
## genre1Adventure -4.951e+07 2.932e+07 -1.688 0.098248 .
## genre1Biography -1.046e+07 2.123e+07 -0.493 0.624710
## genre1Comedy -2.789e+06 1.589e+07 -0.176 0.861431
## genre1Crime -3.805e+07 2.233e+07 -1.704 0.095233 .
## genre1Drama -2.419e+07 2.260e+07 -1.070 0.290135
## genre1Horror 3.516e+07 2.132e+07 1.649 0.106040
## genre1Mystery -2.027e+07 4.014e+07 -0.505 0.616040
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36940000 on 45 degrees of freedom
## Multiple R-squared: 0.8939, Adjusted R-squared: 0.8562
## F-statistic: 23.7 on 16 and 45 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(bo4)
vif(bo4)
## GVIF Df GVIF^(1/(2*Df))
## Season.Daily.Avg 22.273033 1 4.719431
## Budget 3.974359 1 1.993579
## Director.Perf.Pts 13.203012 1 3.633595
## Lead1.Perf.Points 14.529188 1 3.811717
## Rating.Genre.Perf.Pts 2.649588 1 1.627756
## Box.Office.Season 38.563123 4 1.578598
## genre1 4.034130 7 1.104760
durbinWatsonTest(bo4)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.01069427 2.00018 0.668
## Alternative hypothesis: rho != 0
Finally, we plot the resulting model to evaluate how close our predictions are to actual data:
par(mfrow = c(1, 1))
plot(predict(bo4),all_vars$Box.Office.Gross,
xlab="Predicted Box Office Gross $",ylab="Actual Box Office Gross $")
abline(a=0,b=1)
Based on the best model we have made predictions for the Box office gross. Here is the data for the client’s upcoming theatrical releases:
| Box.Office.Gross |
| Season.Gross |
| Season.YoY.Change |
| Season.Days |
| Season.Daily.Avg |
| Season.Movie.Count |
| Season.Movie.Avg |
| Budget |
| runtime |
| imdb_rating |
| imdb_votes |
| metacritic |
| Director.Perf.Pts |
| Lead1.Perf.Points |
| Lead2.Perf.Points |
| Rating.Genre.Perf.Pts |
| Box.Office.Season |
| rating.group |
| genre1 |
| Box.Office.Gross | Season.Gross | Season.YoY.Change | Season.Days | Season.Daily.Avg | Season.Movie.Count | Season.Movie.Avg | Budget | runtime | imdb_rating | imdb_votes | metacritic | Director.Perf.Pts | Lead1.Perf.Points | Lead2.Perf.Points | Rating.Genre.Perf.Pts | Box.Office.Season | rating.group | genre1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NA | -0.006 | 58 | 19.6 | 96 | 11.9 | 0 | 105 | 6.3 | 14496 | 53 | 0 | 0 | 0 | 2570 | Winter | PG-13 | Action |
| 0 | NA | -0.152 | 122 | 31.0 | 245 | 15.4 | 0 | 105 | 6.3 | 14496 | 53 | 0 | 0 | 0 | 60 | Summer | R | Horror |
| 0 | NA | 0.103 | 58 | 20.8 | 157 | 7.7 | 0 | 105 | 6.3 | 14496 | 53 | 0 | 0 | 0 | 40 | Fall | PG-13 | Comedy |
| 0 | NA | -0.672 | 59 | 16.0 | 64 | 14.8 | 0 | 105 | 6.3 | 14496 | 53 | 0 | 10 | 0 | 40 | Holiday | R | Adventure |
Finally, we predict future box office revenue using our model:
predict(bo4,feat)
## 1 2 3 4
## 81059502 308588176 151097635 264754577
-If a movie of duration 105 mins with action genre is released in winter season the Box office gross is expected to be $81,059,502.
-If a movie of duration 105 mins with horror genre is released in summer season the Box office gross is expected to be $308,588,176.
-If a movie of duration 105 mins with comedy genre is released in fall season the Box office gross is expected to be $15,109,763.
-If a movie of duration 105 mins with adventure genre is released in holiday season the Box office gross is expected to be $264,754,577.